Reproducible Research: who, what, where, why, when & how

SESYNC Computational Summer Institute July 2015

Overview

  • Motivation & context
  • Concepts and vocabulary
  • General principles
  • Survey landscape of tools
  • Dissemination example with RShiny

Who is reproducible research for?

Don’t let this be you!

Who is reproducible research for?

  • you, now and in the future
  • collaborators
  • reviewers & editors
  • funding agencies

What is reproducible research?

Access, understanding, sharing

“The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.” - Max Kuhn, CRAN Task View: Reproducible Research



i.e.
Raw data + instructions

What to share

Archive

  • starting dataset
  • metadata
  • data cleaning steps
  • analysis scripts
  • source code
  • readme

Share maybe

  • raw data
  • processed/cleaned data
  • intermediate results

What NOT to share

  • confidential data
  • copyrighted material
  • pre-existing restrictive licenses
  • your passwords and private keys

How to choose the appropriate repository?

  • is there a domain specific repository?
  • what are the backup & replication policies?
  • is there a plan for long-term preservation?
  • can people find your materials?
  • is it citable? (does it provide DOIs)
  • is your purpose archival, sharing or publication?

Why reproducible research?

Why?

sticks

  • journal policies
  • NSF/funding agency policies
  • Congressional mandates

Why?

Increased visibility and citation

Piwowar & Vision (2013) “Data reuse and the open data citation advantage.” PeerJ, e175

https://peerj.com/articles/175/

Figure 1: Citation density for papers with and without publicly available microarray data, by year of study publication.

Better research

Wicherts et al (2011) “Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results.” PLoS ONE 6(11): e26828

Figure 1. Distribution of reporting errors per paper for papers from which data were shared and from which no data were shared.

More efficient, less redundant science

When to think about reproducibility?

When to think about reproducibility?

  • now
  • before you start a project
  • at publication

File organization: a mighty weapon against chaos

A good project layout helps ensure the

  • Integrity of data
  • Portability of the project
  • Easier to pick the project back up after a break

Help find and use your files again

  • Machine readable: deliberate use of delimiters, avoid spaces and punctuation, accented characters
  • Human readable: contains info on content in some way
  • Default ordering; put something numeric first, use ISO 8601 standard for dates YYYY-MM-DD, left pad numbers with zeros
  • File formats: Use non-proprietary file formats such as .csv and .txt rather than Word, Excel, PDFs, images

picture with example file names?

Tools for reproducible research

Shiny example

More references & resources